Project Workflow: Census Demographics, Housing, and Geometries¶

This notebook documents the main analysis workflow for this project using preprocessed data. It shows how to load the curated datasets, perform representative cleaning and transformation steps, join demographics to geometries, and produce both static and interactive visualizations.

All file paths are relative to the project root. A classmate with this repository and the same environment can run the notebook from top to bottom and reproduce the results.

Setup Instructions¶

Before running this notebook:

  1. Create the conda environment:

    conda env create -f environment.yml
    
  2. Register the Jupyter kernel:

    bash setup_jupyter_kernel.sh
    
  3. Select the correct kernel:

    • Click on the kernel name in the top right of the notebook
    • Select "Python (cyplan-101-env)" from the dropdown menu
    • This ensures all packages (including fiona and geopandas) are available
  4. Verify data files exist:

    • Check that data/gold/ directory contains the required files
    • See the "Data sources" section below for expected files

Once the kernel is selected, you can run all cells from top to bottom.

In [1]:
import os
import sys

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

import folium

# Ensure plots appear inline in the notebook
%matplotlib inline

# Set a consistent random seed for any sampling operations
RANDOM_SEED = 42

# Define base directory for project data relative to this notebook
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
DATA_DIR = os.path.join(PROJECT_ROOT, "data")

print("Project root:", PROJECT_ROOT)
print("Data dir:", DATA_DIR)
Project root: /Users/admin/cyplan-101-proj1
Data dir: /Users/admin/cyplan-101-proj1/data

Data sources¶

This project uses preprocessed "gold" datasets that have already gone through earlier bronze and silver pipeline stages. In this notebook we focus on:

  • data/gold/demographics/census_demographics_dp05.csv
    Demographic indicators at the census tract level, including total population and age distribution.

  • data/gold/demographics/census_housing_s1901.csv
    Household income and related measures at the census tract level, including median household income.

  • data/gold/geometries/census_tract_boundaries.geojson
    Polygon geometries for census tracts in the study area. These include a geoid_full identifier used to link to tabular data.

  • data/gold/geometries/address_points.geoparquet
    Point locations representing address records. These provide context on where activity is concentrated.

  • data/gold/geometries/street_centerlines.geoparquet
    Line geometries representing street centerlines in the study area.

The goal is to connect tract-level demographic and housing data to the tract geometries and explore spatial patterns using maps and summary plots.

In [2]:
# Define paths to gold-layer datasets used in this notebook.
# Using relative paths tied to the project structure keeps the workflow reproducible.

demographics_path = os.path.join(DATA_DIR, "gold", "demographics", "census_demographics_dp05.csv")
housing_path = os.path.join(DATA_DIR, "gold", "demographics", "census_housing_s1901.csv")
tracts_path = os.path.join(DATA_DIR, "gold", "geometries", "census_tract_boundaries.geojson")
address_points_path = os.path.join(DATA_DIR, "gold", "geometries", "address_points.geoparquet")
streets_path = os.path.join(DATA_DIR, "gold", "geometries", "street_centerlines.geoparquet")

# Load tabular data from the gold layer. These tables contain one row per census tract.
demographics = pd.read_csv(demographics_path)
housing = pd.read_csv(housing_path)

# Load tract geometries and supporting geographic context using GeoPandas.
tracts = gpd.read_file(tracts_path)
address_points = gpd.read_parquet(address_points_path)
street_centerlines = gpd.read_parquet(streets_path)

print("Demographics rows:", len(demographics))
print("Housing rows:", len(housing))
print("Tracts rows:", len(tracts))
Demographics rows: 379
Housing rows: 379
Tracts rows: 378
In [3]:
# Inspect a few columns to confirm tract identifiers and key measures.
# The `geography` column contains a full census tract identifier (e.g., 1400000US06001400100).
# The `geoid_full` column on the tract geometries uses the same identifier pattern.

print("Demographics columns:\n", demographics.columns[:10])
print("Housing columns:\n", housing.columns[:10])
print("Tract columns:\n", tracts.columns)

# Show the first few rows of each dataset to ground the rest of the workflow.
demographics.head()
Demographics columns:
 Index(['geography', 'est_geographic_area_name',
       'est_sex_and_age_total_population', 'moe_sex_and_age_total_population',
       'est_sex_and_age_total_population_male',
       'moe_sex_and_age_total_population_male',
       'est_sex_and_age_total_population_female',
       'moe_sex_and_age_total_population_female',
       'est_sex_and_age_total_population_sex_ratio_males_per_100_females',
       'moe_sex_and_age_total_population_sex_ratio_males_per_100_females'],
      dtype='object')
Housing columns:
 Index(['geography', 'est_geographic_area_name', 'est_households_total',
       'moe_households_total', 'est_households_total_less_than_dollars10000',
       'moe_households_total_less_than_dollars10000',
       'est_households_total_dollars10000_to_dollars14999',
       'moe_households_total_dollars10000_to_dollars14999',
       'est_households_total_dollars15000_to_dollars24999',
       'moe_households_total_dollars15000_to_dollars24999'],
      dtype='object')
Tract columns:
 Index(['geoid_full', 'tract_name', 'tract_full_name', 'min_longitude',
       'max_longitude', 'min_latitude', 'max_latitude', 'center_longitude',
       'center_latitude', 'geometry'],
      dtype='object')
Out[3]:
geography est_geographic_area_name est_sex_and_age_total_population moe_sex_and_age_total_population est_sex_and_age_total_population_male moe_sex_and_age_total_population_male est_sex_and_age_total_population_female moe_sex_and_age_total_population_female est_sex_and_age_total_population_sex_ratio_males_per_100_females moe_sex_and_age_total_population_sex_ratio_males_per_100_females ... moe_percent_hispanic_or_latino_and_race_total_population_not_hispanic_or_latino_two_or_more_races_two_races_excluding_some_other_race_and_three_or_more_races est_percent_total_housing_units moe_percent_total_housing_units est_percent_citizen_voting_age_population_citizen_18_and_over_population moe_percent_citizen_voting_age_population_citizen_18_and_over_population est_percent_citizen_voting_age_population_citizen_18_and_over_population_male moe_percent_citizen_voting_age_population_citizen_18_and_over_population_male est_percent_citizen_voting_age_population_citizen_18_and_over_population_female moe_percent_citizen_voting_age_population_citizen_18_and_over_population_female unnamed:_378
0 1400000US06001400100 Census Tract 4001; Alameda County; California 3094 452 1552 334 1542 190 100.6 19.5 ... 2.4 (X) (X) 2256 (X) 47.2 4.3 52.8 4.3 NaN
1 1400000US06001400200 Census Tract 4002; Alameda County; California 2093 224 1081 180 1012 155 106.8 24.8 ... 2.6 (X) (X) 1636 (X) 51.9 6.1 48.1 6.1 NaN
2 1400000US06001400300 Census Tract 4003; Alameda County; California 5727 607 2818 487 2909 337 96.9 19.8 ... 4.6 (X) (X) 4443 (X) 47.5 5.0 52.5 5.0 NaN
3 1400000US06001400400 Census Tract 4004; Alameda County; California 4395 640 2001 346 2394 419 83.6 16.3 ... 1.9 (X) (X) 3183 (X) 46.5 5.3 53.5 5.3 NaN
4 1400000US06001400500 Census Tract 4005; Alameda County; California 3822 737 1885 281 1937 560 97.3 26.1 ... 4.7 (X) (X) 3056 (X) 44.9 7.2 55.1 7.2 NaN

5 rows × 379 columns

In [4]:
# Display a small sample from the housing and tract tables.
# This helps verify that the same tract identifier appears in both tabular and spatial data.

housing.head()
Out[4]:
geography est_geographic_area_name est_households_total moe_households_total est_households_total_less_than_dollars10000 moe_households_total_less_than_dollars10000 est_households_total_dollars10000_to_dollars14999 moe_households_total_dollars10000_to_dollars14999 est_households_total_dollars15000_to_dollars24999 moe_households_total_dollars15000_to_dollars24999 ... moe_nonfamily_households_median_income_dollars est_nonfamily_households_mean_income_dollars moe_nonfamily_households_mean_income_dollars est_nonfamily_households_percent_allocated_household_income_in_the_past_12_months moe_nonfamily_households_percent_allocated_household_income_in_the_past_12_months est_nonfamily_households_percent_allocated_family_income_in_the_past_12_months moe_nonfamily_households_percent_allocated_family_income_in_the_past_12_months est_nonfamily_households_percent_allocated_nonfamily_income_in_the_past_12_months moe_nonfamily_households_percent_allocated_nonfamily_income_in_the_past_12_months unnamed:_130
0 1400000US06001400100 Census Tract 4001; Alameda County; California 1316 142 6.0 5.8 0.0 3.2 1.8 2.9 ... 97096 225423 62111 (X) (X) (X) (X) 20.6 (X) NaN
1 1400000US06001400200 Census Tract 4002; Alameda County; California 861 86 1.4 1.7 1.7 1.5 1.6 1.5 ... 41387 153186 24879 (X) (X) (X) (X) 31.5 (X) NaN
2 1400000US06001400300 Census Tract 4003; Alameda County; California 2713 273 4.4 3.8 3.5 2.9 3.6 3.2 ... 18869 126792 23123 (X) (X) (X) (X) 31.3 (X) NaN
3 1400000US06001400400 Census Tract 4004; Alameda County; California 1803 173 0.4 0.8 0.4 0.7 7.0 4.2 ... 41855 145833 23424 (X) (X) (X) (X) 35.4 (X) NaN
4 1400000US06001400500 Census Tract 4005; Alameda County; California 1655 283 1.5 1.5 2.2 2.0 8.8 4.7 ... 17894 105792 17797 (X) (X) (X) (X) 28.4 (X) NaN

5 rows × 131 columns

In [5]:
tracts.head()
Out[5]:
geoid_full tract_name tract_full_name min_longitude max_longitude min_latitude max_latitude center_longitude center_latitude geometry
0 1400000US06001438201 4382.01 Census Tract 4382.01 -122.084513 -122.071712 37.617563 37.631558 -122.076439 37.626268 POLYGON ((-122.07 37.63, -122.07 37.63, -122.0...
1 1400000US06001440333 4403.33 Census Tract 4403.33 -122.087429 -122.061052 37.596433 37.613476 -122.073828 37.605330 POLYGON ((-122.07 37.61, -122.07 37.61, -122.0...
2 1400000US06001441524 4415.24 Census Tract 4415.24 -122.059295 -122.038506 37.549454 37.569413 -122.050693 37.559478 POLYGON ((-122.05 37.57, -122.05 37.57, -122.0...
3 1400000US06001444200 4442 Census Tract 4442 -122.055180 -122.033659 37.533296 37.551742 -122.045728 37.542543 POLYGON ((-122.03 37.54, -122.03 37.54, -122.0...
4 1400000US06001444304 4443.04 Census Tract 4443.04 -122.070890 -122.005966 37.478609 37.534691 -122.035882 37.508137 POLYGON ((-122.04 37.53, -122.04 37.53, -122.0...

Creating a shared tract identifier¶

The gold-layer tables use two different identifier fields for census tracts:

  • The demographic and housing tables use a geography string such as 1400000US06001400100.
  • The tract geometry file stores a geoid_full field with the same pattern.

In this section we create a simplified geoid column in each table by removing the common 1400000US prefix. This provides a clean key for joining tabular attributes to tract geometries.

In [6]:
# Create a simplified geoid key shared across demographics, housing, and tracts.
# The original identifiers start with '1400000US'; removing this prefix leaves the
# underlying state + county + tract code.

prefix = "1400000US"

demographics["geoid"] = demographics["geography"].str.replace(prefix, "", regex=False)
housing["geoid"] = housing["geography"].str.replace(prefix, "", regex=False)

if "geoid_full" in tracts.columns:
    tracts["geoid"] = tracts["geoid_full"].str.replace(prefix, "", regex=False)
else:
    raise KeyError("Expected 'geoid_full' column in tract geometries.")

# Verify that the same geoid values appear in all three tables.
print("Unique geoids in demographics:", demographics["geoid"].nunique())
print("Unique geoids in housing:", housing["geoid"].nunique())
print("Unique geoids in tracts:", tracts["geoid"].nunique())
Unique geoids in demographics: 379
Unique geoids in housing: 379
Unique geoids in tracts: 378
In [7]:
# Select a focused set of demographic variables for this notebook.
# These come from the DP05 table and will be used to illustrate joins and summaries.

demo_columns = [
    "geoid",
    "est_sex_and_age_total_population",
    "est_race_total_population_one_race_white",
    "est_race_total_population_one_race_black_or_african_american",
    "est_race_total_population_one_race_asian",
    "est_hispanic_or_latino_and_race_total_population_hispanic_or_latino_of_any_race",
]

demographics_small = demographics[demo_columns].copy()

demographics_small.head()
Out[7]:
geoid est_sex_and_age_total_population est_race_total_population_one_race_white est_race_total_population_one_race_black_or_african_american est_race_total_population_one_race_asian est_hispanic_or_latino_and_race_total_population_hispanic_or_latino_of_any_race
0 06001400100 3094 2246 137 462 200
1 06001400200 2093 1423 71 256 196
2 06001400300 5727 3478 524 609 497
3 06001400400 4395 2742 457 422 604
4 06001400500 3822 1773 919 312 557
In [8]:
# Select the tract identifier and median household income from the housing table.
# This keeps the notebook focused on one income measure that is easy to interpret.

income_column = "est_households_median_income_dollars"

housing_small = housing[["geoid", income_column]].copy()

# Convert income column to numeric, handling special cases like "250,000+"
def convert_income_to_numeric(value):
    """Convert income string to numeric value."""
    if pd.isna(value):
        return None
    if isinstance(value, (int, float)):
        return float(value)
    # Handle string values like "250,000+"
    value_str = str(value).strip()
    if value_str.endswith("+"):
        # For "250,000+", convert to 250000
        value_str = value_str[:-1].replace(",", "")
        return float(value_str)
    # Remove commas and convert to float
    value_str = value_str.replace(",", "")
    try:
        return float(value_str)
    except ValueError:
        return None

housing_small["median_income_numeric"] = housing_small[income_column].apply(convert_income_to_numeric)

# Drop tracts where median income is not reported to avoid ambiguous values.
before_rows = len(housing_small)
housing_small = housing_small.dropna(subset=["median_income_numeric"])
after_rows = len(housing_small)

print(f"Dropped {before_rows - after_rows} tracts with missing median household income.")

housing_small.head()
Dropped 6 tracts with missing median household income.
Out[8]:
geoid est_households_median_income_dollars median_income_numeric
0 06001400100 250,000+ 250000.0
1 06001400200 225880 225880.0
2 06001400300 157731 157731.0
3 06001400400 159612 159612.0
4 06001400500 96250 96250.0

Joining attributes to tract geometries¶

With a common geoid key in place, we can:

  1. Join the demographic and housing tables into a single tract-level attribute table.
  2. Attach those attributes to the tract polygons.

This produces a GeoDataFrame that is ready for both statistical summaries and map-based visualizations.

In [9]:
# Join demographics and housing on the shared geoid key.
# A left join preserves all tracts that appear in the demographic table.

tract_attributes = demographics_small.merge(
    housing_small[["geoid", income_column, "median_income_numeric"]],
    on="geoid",
    how="left",
    validate="one_to_one",
)

tract_attributes.head()
Out[9]:
geoid est_sex_and_age_total_population est_race_total_population_one_race_white est_race_total_population_one_race_black_or_african_american est_race_total_population_one_race_asian est_hispanic_or_latino_and_race_total_population_hispanic_or_latino_of_any_race est_households_median_income_dollars median_income_numeric
0 06001400100 3094 2246 137 462 200 250,000+ 250000.0
1 06001400200 2093 1423 71 256 196 225880 225880.0
2 06001400300 5727 3478 524 609 497 157731 157731.0
3 06001400400 4395 2742 457 422 604 159612 159612.0
4 06001400500 3822 1773 919 312 557 96250 96250.0
In [10]:
# Attach the combined attributes to tract geometries.
# This creates a GeoDataFrame with both spatial and tabular information.

tracts_with_data = tracts.merge(
    tract_attributes,
    on="geoid",
    how="left",
    validate="one_to_one",
)

tracts_with_data[["geoid", "est_sex_and_age_total_population", income_column, "median_income_numeric"]].head()
Out[10]:
geoid est_sex_and_age_total_population est_households_median_income_dollars median_income_numeric
0 06001438201 4671 107500 107500.0
1 06001440333 2516 127708 127708.0
2 06001441524 3989 250,000+ 250000.0
3 06001444200 6271 137773 137773.0
4 06001444304 6809 224511 224511.0
In [11]:
# Create a simple population density measure using tract area.
# A projected CRS is required so that area is measured in meters rather than degrees.

if tracts_with_data.crs is None or tracts_with_data.crs.is_geographic:
    tracts_projected = tracts_with_data.to_crs(epsg=3857)
else:
    tracts_projected = tracts_with_data.to_crs(epsg=3857)

tracts_projected["area_sqkm"] = tracts_projected.geometry.area / 1_000_000

# Avoid division by zero by excluding tracts with zero area.
tracts_projected = tracts_projected[tracts_projected["area_sqkm"] > 0].copy()

tracts_projected["pop_density"] = (
    tracts_projected["est_sex_and_age_total_population"]
    / tracts_projected["area_sqkm"]
)

tracts_projected[["geoid", "area_sqkm", "pop_density", "median_income_numeric"]].head()
Out[11]:
geoid area_sqkm pop_density median_income_numeric
0 06001438201 1.564604 2985.419910 107500.0
1 06001440333 1.564183 1608.507178 127708.0
2 06001441524 3.126476 1275.877314 250000.0
3 06001444200 3.125637 2006.310819 137773.0
4 06001444304 31.243075 217.936291 224511.0
In [12]:
# Summarize the distribution of median household income and population density.

income_summary = tracts_projected["median_income_numeric"].describe()
density_summary = tracts_projected["pop_density"].describe()

print("Median household income summary:\n", income_summary)
print("\nPopulation density (persons per sq km) summary:\n", density_summary)
Median household income summary:
 count       300.000000
mean     142180.023333
std       55847.475205
min       30833.000000
25%      101490.500000
50%      131883.000000
75%      182496.000000
max      250000.000000
Name: median_income_numeric, dtype: float64

Population density (persons per sq km) summary:
 count     304.000000
mean     1840.097771
std      1107.378718
min         0.000000
25%      1024.996900
50%      1735.252093
75%      2568.341620
max      5850.215199
Name: pop_density, dtype: float64

Visualizing income and population patterns¶

The joined GeoDataFrame supports both numeric summaries and map-based views. In this section we:

  • Plot histograms of median household income and population density.
  • Create a static choropleth map of median income by census tract.

These visualizations help reveal how economic conditions and population concentrations vary across the county.

In [13]:
# Histogram of median household income across tracts.
# Use the numeric column for plotting.

fig, ax = plt.subplots(figsize=(8, 4))
tracts_projected["median_income_numeric"].dropna().plot.hist(
    bins=30,
    ax=ax,
    edgecolor="black",
)
ax.set_title("Distribution of Median Household Income by Census Tract")
ax.set_xlabel("Median Household Income (dollars)")
ax.set_ylabel("Number of Tracts")
plt.tight_layout()
No description has been provided for this image
In [14]:
# Histogram of population density to highlight areas of high and low concentration.

fig, ax = plt.subplots(figsize=(8, 4))
tracts_projected["pop_density"].dropna().plot.hist(
    bins=30,
    ax=ax,
    edgecolor="black",
)
ax.set_title("Distribution of Population Density by Census Tract")
ax.set_xlabel("Population per square kilometer")
ax.set_ylabel("Number of Tracts")
plt.tight_layout()
No description has been provided for this image
In [15]:
# Static choropleth map of median household income by tract.
# This uses the tract polygons to visualize spatial variation.

tracts_for_plot = tracts_projected.to_crs(epsg=4326)

fig, ax = plt.subplots(figsize=(8, 8))
tracts_for_plot.plot(
    column="median_income_numeric",
    cmap="viridis",
    linewidth=0.2,
    edgecolor="gray",
    legend=True,
    ax=ax,
)
ax.set_axis_off()
ax.set_title("Median Household Income by Census Tract")
plt.tight_layout()
No description has been provided for this image

Interactive map of income and address density¶

To make the spatial patterns easier to explore, we now create an interactive web map. The map combines:

  • A choropleth of median household income by census tract.
  • A sampled set of address points to show where activity is concentrated.

The map uses the geoid key and tract geometries already prepared above, so the workflow from raw tables to interactive visualization is fully documented.

In [16]:
# Reproject tract and address geometries to WGS84 for web mapping.

tracts_wgs84 = tracts_projected.to_crs(epsg=4326)
address_points_wgs84 = address_points.to_crs(epsg=4326)

# Compute a map center from tract centroids.
centroids = tracts_wgs84.geometry.centroid
center_lat = centroids.y.mean()
center_lon = centroids.x.mean()

m = folium.Map(location=[center_lat, center_lon], zoom_start=11, tiles="CartoDB positron")

# Add a choropleth of median household income by tract.
# Use the numeric column for the choropleth
folium.Choropleth(
    geo_data=tracts_wgs84.to_json(),
    data=tracts_wgs84,
    columns=["geoid", "median_income_numeric"],
    key_on="feature.properties.geoid",
    fill_color="YlGnBu",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Median Household Income (dollars)",
).add_to(m)

# Sample address points to keep rendering performance reasonable.
sample_size = min(2000, len(address_points_wgs84))
address_sample = address_points_wgs84.sample(sample_size, random_state=RANDOM_SEED)

for _, row in address_sample.iterrows():
    folium.CircleMarker(
        location=[row.geometry.y, row.geometry.x],
        radius=1,
        color="black",
        weight=0,
        fill=True,
        fill_opacity=0.4,
    ).add_to(m)

m
/var/folders/zf/g_wymr6j4pd1lz7svfjqwq9r0000gn/T/ipykernel_48922/601200033.py:7: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  centroids = tracts_wgs84.geometry.centroid
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

How to rerun this notebook¶

To reproduce the analysis:

  1. Create and activate the project environment using the configuration in environment.yml.
  2. Start JupyterLab or Jupyter Notebook from the project root directory.
  3. Open notebooks/project_workflow.ipynb.
  4. Run all cells in order from top to bottom.

All file paths used in this notebook are relative to the repository structure, and the notebook operates on the preprocessed gold datasets under data/gold. Runtime should remain under approximately 15 minutes on a typical laptop because the notebook works from existing outputs rather than recomputing earlier pipeline stages.

Summary¶

This notebook:

  • Loads gold-layer demographic, housing, and geometry data for Alameda County census tracts.
  • Creates a shared geoid key to join attributes to tract geometries.
  • Constructs simple derived measures such as tract area and population density.
  • Visualizes spatial patterns in median household income using both static plots and an interactive Folium map.

A classmate can follow the comments and markdown cells to understand each step of the workflow and reproduce the same figures from the shared project data.